pic
Personal
Website

Performance Improvements in Julia (Part 4): Pre-Allocations

Martin Alfaro - PhD in Economics
July 3, 2023
Remark
The script in this post is available here under the name allCode.jl. It's been tested under Julia 1.9.

For more accurate benchmarks, each variable is wrapped in ref(x) = (Ref(x))[] and interpolated through $.

Introduction

This is the 4th part on performance improvements in Julia that are easy to implement. The previous parts covered lazy broadcasting, tuples/static vectors, and views. This part discusses pre-allocation of outputs.

The technique addresses the memory allocation that comes with creating new objects. It leverages that mutating the contents of an existing object does not allocate, enabling the user to avoid creating a new object in every iteration of a loop. Although we'll illustrate the concepts through vectors, the same principle applies to other objects. [note] Notice that pre-allocating outputs entails no benefits if we only need to compute a result once, as we always need to create a new object to start with.

The main takeaway from the post is that pre-allocations can greatly enhance performance. However, they can also hinder code readability. Therefore, you should decide whether its implementation pays off, by carefully weighing the trade-off between performance and code complexity involved.

Preliminaries

The packages used in this post are the following.

Packages and Variables
using BenchmarkTools, Random, Statistics

Random.seed!(1234)
x = rand(100)

We'll suppose that our ultimate goal is to store the computation of the vector output, and that this requires computing a vector temp multiple times. Notice that pre-allocations entail no advantages when temp only operates on single elements, even if the operations are performed multiple times. The reason is that single elements do not allocate memory. For example, temp in the following code only involves manipulating single elements of x, and so no allocation arises.

Single Elements Do Not Allocate
function example1(x)
    output = similar(x)    
    
    for i in eachindex(x)
        temp      = x[1] / x[i]
        output[i] = mean(temp)
    end

    return output
end

@btime example1(ref($x));
Output in REPL
  80.864 ns (1 allocation: 896 bytes)

The only allocation arising in the example is due to the creation of the vector output. As we consider that the user wants to store output, this allocation is unavoidable.

Pre-Allocations

Based on the above observations, our focus will be on a vector temp. To keep the code simple, we also initialize output by introducing it as a keyword argument. [note] The keyword argument similar(x) uses the function's argument x. This is possible in Julia as a function's arguments are evaluated sequentially, allowing the user to define a keyword argument based on any argument defined before.

In particular, we'll consider the following baseline example.

Baseline Example
function example2(x; output=similar(x))     
    for i in eachindex(x)
        temp      = x ./ x[i]
        output[i] = mean(temp)
    end

    return output
end

@btime example2(ref($x));
Output in REPL
  5.540 μs (101 allocations: 88.38 KiB)

Out of the 101 allocations, one corresponds to the creation of output and hence will always occur. The remaining 100 allocations arise because a new vector temp is created in each iteration of the for-loop (recall that the length of x is 100, which defines the number of iterations).

We can avoid these 100 allocations by pre-allocating temp. This requires creating temp before the loop, which we then update within the loop. The first example below remarks a wrong use of pre-allocations, where we don't use .=. The other two examples represent equivalent ways to perform the pre-allocation.

function compute_output1(x; output=similar(x)) 
    temp = similar(x)

    for i in eachindex(x)
        temp      = x ./ x[i]
        output[i] = mean(temp)
    end

    return output
end

@btime compute_output1(ref($x));
Output in REPL
  5.640 μs (102 allocations: 89.25 KiB)
function compute_output2(x; output=similar(x)) 
    temp = similar(x)

    for i in eachindex(x)
        temp      .= x ./ x[i]
        output[i]  = mean(temp)
    end

    return output
end

@btime compute_output2(ref($x));
Output in REPL
  4.143 μs (2 allocations: 1.75 KiB)
function compute_output3(x; output=similar(x)) 
    temp = similar(x)

    for i in eachindex(x)
        @. temp   = x / x[i]
        output[i] = mean(temp)
    end

    return output
end

@btime compute_output3(ref($x));
Output in REPL
  4.057 μs (2 allocations: 1.75 KiB)

The use of the macro @. in example3 is the easiest way to perform the pre-allocation of temp. This macro adds a dot to each operator in the expression, including =, thereby ensuring we're not omitting any dot unintentionally.

Remark
Pre-Allocations and In-Place operations are closely related. In fact, pre-allocating temp in the previous example is equivalent to doing it through an in-place operation. The following example shows this equivalence, where the content of temp is updated in each iteration by executing preallocated!.

Pre-Allocation as an In-Place Operation
function preallocated!(temp, x, i)
    for j in eachindex(x)
        temp[j] = x[j] / x[i]
    end    
end

function compute_output4(x; output=similar(x)) 
    temp = similar(x)

    for i in eachindex(x)
        preallocated!(temp, x, i)
        output[i] = mean(temp)
    end

    return output
end

@btime compute_output4(ref($x));

Repeated Application of Pre-Allocation

The function compute_output3 in the last example reduces the number of allocations from 102 to 2. These 2 allocations correspond to the creation of vectors output and temp.

The approach is useful if output is the final output we're interested in. However, this result could eventually be used as an input in another loop, determining that the 2 allocations would be repeated multiple times. To show this, consider the following example.

Output Used Multiple Times
function intermediate_result(x; output=similar(x))
    temp = similar(x)

    for i in eachindex(x)
        @. temp   = x / x[i]
        output[i] = mean(temp)
    end

    return output
end

final_result(x) = [intermediate_result(x) for _ in 1:100]

@btime final_result(ref($x));
Output in REPL
  410.400 μs (201 allocations: 175.88 KiB)

The example shows that the 2 allocations are repeated 100 times, making a total of 201 allocations (200 + 1 allocation from creating the vector in final_result itself). However, we can further reduce the number of allocations by creating temp and output outside the loop in final_result.

Output Used Multiple Times
function intermediate_result(x, output, temp)
    for i in eachindex(x)
        @. temp   = x / x[i]
        output[i] = mean(temp)
    end

    return output
end

function final_result(x)
    output = similar(x)
    temp   = similar(x)
    [intermediate_result(x, output, temp) for i in 1:100]
end

@btime final_result(ref($x));
Output in REPL
  403.800 μs (3 allocations: 2.62 KiB)

Some Conclusions

The last example reveals that pre-allocating outputs can make the code quite cumbersome. This means that you should assess whether improving performance is worth the extra complexity. To clearly see this, compare the code above with the following one, which provides an identical result:

Simple Code
intermediate_result(x) = mean.(x ./ x[i] for i in eachindex(x))
final_result(x)        = [intermediate_result(x) for _ in 1:100]

@btime final_result(ref($x))
Output in REPL
  543.900 μs (10201 allocations: 8.72 MiB)

Consequently, you should first consider alternative options to pre-allocations when possible. For example, depending on the operation to perform, the use of lazy broadcasting or static vectors may keep the code simpler while achieving similar performance.

Finally, notice that we considered examples where the vector's length is 100. This explains why there are no big differences in time between the approaches. This suggests that the performance gains from pre-allocations only materialize when the vector's length is larger. Indeed, when we increase the vector's length to 10,000, the differences in time become evident, as the following example demonstrates.

Random.seed!(1234)
x = rand(10_000)

intermediate_result(x) = mean.(x ./ x[i] for i in eachindex(x))
final_result(x)        = [intermediate_result(x) for _ in 1:100]

@btime final_result(ref($x))
Output in REPL
  71.684 s (2000401 allocations: 74.57 GiB)
Random.seed!(1234)
x = rand(10_000)
    
function intermediate_result(x, output, temp)
    for i in eachindex(x)
        @. temp   = x / x[i]
        output[i] = mean(temp)
    end

    return output
end

function final_result(x)
    output = similar(x)
    temp   = similar(x)
    [intermediate_result(x, output, temp) for i in 1:100]
end

@btime final_result(ref($x));
Output in REPL
  4.723 s (5 allocations: 157.22 KiB)